Getting Started
This page provides instructions for setting up and running the GovDoc Scanner project.
Prerequisites
- Node.js: v18.x or newer (recommended: v20.x)
- Git: For cloning the repository
- Gemini API Key: Required for AI-powered document processing (Get one here)
Quick Start
1. Clone and Install
git clone https://github.com/flexivian/govdoc-scanner.git
cd govdoc-scanner
npm install
2. Environment Setup
cp .env.example .env
# Edit .env and add your Gemini API key:
# GEMINI_API_KEY=your_gemini_api_key_here
3. First Run
Test with the interactive CLI (recommended):
npm start govdoc
# Follow the interactive prompts to:
# - Choose input method (file, manual, or random)
# - Process companies with automated workflow
# - View progress and results
Project Applications
The project includes three main applications:
- CLI Tool (
govdoc
): Complete end-to-end interactive workflow (recommended for most users) - Crawler (
crawler
): Search and download documents from GEMI portal with enhanced date extraction - Doc-Scanner (
scanner
): Process documents with AI-powered chronological analysis, representative tracking, and automatic change detection
Individual Application Usage
CLI Tool (Recommended)
npm start govdoc
# Interactive mode with guided prompts:
# - File input, manual entry, or random selection
# - Automated crawling and document processing
# - Progress tracking and comprehensive summaries
# - Output saved to ./output/ at project root
Command line mode for automation:
# Process from file
npm start govdoc -- --input ./companies.gds
# Process random companies
npm start govdoc -- --company-random 10
# Show help
npm start govdoc -- --help
If you prefer to run each step separately(crawler -> scanner), make sure to use LOG_LEVEL=DEBUG
for detailed output when running the separate apps:
Crawler
npm start crawler
# Search for companies or download by GEMI ID
# Results saved to apps/crawler/src/downloads/ and ids.txt to apps/crawler/src/ids.txt
Doc-Scanner
npm start scanner
# Process documents from input directory
# Requires manual document placement in apps/doc-scanner/src/data/input/
# Important: Name files with date prefixes (YYYY-MM-DD) for chronological processing
# Features:
# - Intelligent processing: skips documents that are already up to date
# - Change tracking: automatically summarizes significant changes between versions
# - Comprehensive metadata with representative tracking and ownership history
Output Structure
After processing, find results in:
output/
├── 123204604000/
│ ├── 123204604000_final_metadata.json # Comprehensive company metadata with tracked changes
│ └── document_downloads/
│ └── *.pdf, *.docx files
└── govdoc-output.json # Summary
The metadata file includes:
- Current Snapshot: Latest company state with all extracted information
- Tracked Changes: Document-by-document history of significant changes
- Representative Tracking: Complete ownership and role evolution
- Change Summaries: Human-readable summaries of key modifications
Next Steps
- Check Development Setup for advanced configuration
- Explore Code Examples for usage patterns
- Review GSoC 2025 Overview for project background
Troubleshooting
- API Key Issues: Ensure valid Gemini API key in
.env
file - Browser Issues: Run
npx playwright install chromium
- Permissions: Check write access to output directories
- Debug Mode: Use
LOG_LEVEL=DEBUG
environment variable for detailed logging and troubleshooting